Introduction

The objective of this document is to provide a sample of my coding as I preform a variety of data handling tasks. I use the light topic as a

Extracting Place Data from Lyrics

The first step in the analysis is to extract location names from the unstructured text of the song lyrics.

In order to do this systematically and efficiently, I will rely on three common sense assumptions about the lyrics: 1. Althought other aspects of the song are repetative, lines with place names are not repeated. 2. Unlike most other song lyrics, place names are capitalized. 3. Place names will not include generic stop words.

#I read in the lyrics from a text file as a charecter vector with an string for each line
  setwd("~/Documents/Job Search/Code Sample")
  lyrics <- readLines("./I've Been Everywhere")

#Lets extract ngrams from the song
  #Eliminating unessesary lines (first assumption)
    lines <- lyrics[! lyrics == ""] #We can eliminate blank lines
    lines <- lines[! lines %in% lines[duplicated(lines)]] #eliminate duplicate lines 
    
  #Pulling out capitalized phrases (second assumption)
    library(stringr)
    phrases <- str_extract_all(lines, pattern = "([A-Z][a-z'\\-]+ ?)+") 
    phrases <- unlist(phrases)
    phrases <- trimws(phrases)

    
  #Eliminating place names that include stop words (third assumption)
    library(stopwords)
    regex_stopwords <- paste(paste0(" ",paste(stopwords(), collapse = " | ")," "),
                             paste0("^",paste(stopwords(), collapse = " |^")," "),
                             paste0(" ",paste(stopwords(), collapse = "$| "),"$"),
                             paste0("^",paste(stopwords(), collapse = "$|^"),"$"),
                             sep = "|")
                                #^ these pastes make a long regex that will caputure stopwords
                             
    places <- phrases[-grep(regex_stopwords, tolower(phrases))] #removes phrases with stopwords
    
  #Lets look at our list
    places <- places[-duplicated(places)]
    places
##  [1] "Winnemucca"    "Mack"          "Listen"        "Reno"         
##  [5] "Chicago"       "Fargo"         "Minnesota"     "Buffalo"      
##  [9] "Toronto"       "Winslow"       "Sarasota"      "Wichita"      
## [13] "Tulsa"         "Ottawa"        "Oklahoma"      "Tampa"        
## [17] "Panama"        "Mattawa"       "La Paloma"     "Bangor"       
## [21] "Baltimore"     "Salvador"      "Amarillo"      "Tocopilla"    
## [25] "Barranquilla"  "Padilla"       "Boston"        "Charleston"   
## [29] "Dayton"        "Louisiana"     "Washington"    "Houston"      
## [33] "Kingston"      "Texarkana"     "Monterey"      "Faraday"      
## [37] "Santa Fe"      "Tallapoosa"    "Glen Rock"     "Black Rock"   
## [41] "Little Rock"   "Oskaloosa"     "Tennessee"     "Hennessey"    
## [45] "Chicopee"      "Spirit Lake"   "Grand Lake"    "Devil's Lake" 
## [49] "Crater Lake"   "Pete's"        "Louisville"    "Nashville"    
## [53] "Knoxville"     "Ombabika"      "Schefferville" "Jacksonville" 
## [57] "Waterville"    "Costa Rica"    "Pittsfield"    "Springfield"  
## [61] "Bakersfield"   "Shreveport"    "Hackensack"    "Cadillac"     
## [65] "Fond"          "Lac"           "Davenport"     "Idaho"        
## [69] "Jellico"       "Argentina"     "Diamantina"    "Pasadena"     
## [73] "Catalina"      "Pittsburgh"    "Parkersburg"   "Gravelbourg"  
## [77] "Colorado"      "Ellensburg"    "Rexburg"       "Vicksburg"    
## [81] "El Dorado"     "Larimore"      "Admore"        "Haverstraw"   
## [85] "Chatanika"     "Chaska"        "Nebraska"      "Alaska"       
## [89] "Opelika"       "Baraboo"       "Waterloo"      "Kalamazoo"    
## [93] "Kansas City"   "Sioux City"    "Cedar City"    "Dodge City"

It looks like a systematic handling of the song lyrics, aided by the stated assumptions, did an alright job pulling place names out of the song lyrics. However, it does look like there were some errors, two false positives and a false negative. Such is the messy reality of text data.

Listen" and “Pete’s” were retained. They fit my assumptions even though they was not a place names. “Fond du Lac” was not included. It failed to meet my assumptions even though it was a place name. I manualy correct these errors below

#Removing false positives
  places <- places[! places %in% c("Listen", "Pete's")]

#Adding the place name Fond du Lac
  places <- places[! places %in% c("Fond","Lac")]
  places <- c(places, "Fond du Lac")

Geocoding

Place names are not very helpful on their own.A geocoding API, accessed through the ggmap package allows us to us to pull location data using the place names. This is as if we were searching google maps for each location.

Just knowing place names is not particuarly interesting. I want to get data about these places. I will use google map’s API to get information about the locations we have extracted from songlyrics. I will write my own geocoding fucntion to do so. There is a preexisting package with functions to interact with the google maps API, however the function it includes is inflexible and the method of authentication is outdated.

library(jsonlite)
vector_fromJSON <- Vectorize(fromJSON, SIMPLIFY = FALSE)
  
#Now lets write a geocoding function that returns a dataframe
  geocode <- function(locations, 
                      regioncode = 'us', 
                      APIkey){
                             
      json_locations <- gsub(' ', '+', locations)
                                    
      response <- vector_fromJSON(paste0('https://maps.googleapis.com/maps/api/geocode/json?',
                                         'address=', 
                                         json_locations,
                                         '&regioncode=', regioncode,
                                         '&key=', APIkey))
                              
      response <- unlist(response, recursive = F)
      status <- response[grep('.status$',names(response))]
      status <- unlist(status)
      results <- response[grep('.results$',names(response))]
                              
      lat       <- as.numeric(unlist(lapply(results, function(x){x$geometry$location$lat[1]})))
      lng       <- as.numeric(unlist(lapply(results, function(x){x$geometry$location$lng[1]})))
                              
      address <- lapply(results, 
                        function(x){address = data.frame(x$address_components[[1]]$long_name, 
                                                         unlist(lapply(x$address_components[[1]]$types, 
                                                                       paste, 
                                                                       collapse = ", ")),
                                                         stringsAsFactors = FALSE)})
      address <- lapply(address,
                        function(x){names(x)<-c("comps", "types");x})
                              
      country   <- unlist(lapply(address, 
                                 function(x){ifelse(length(x$comps[grep('country', 
                                                                        x$types)]) == 1, 
                                                    x$comps[grep('country', x$types)],
                                                    NA)}))
      
      state   <- unlist(lapply(address, 
                               function(x){ifelse(length(x$comps[grep('administrative_area_level_1', 
                                                                      x$types
                                                                      )]) == 1, 
                                           x$comps[grep('administrative_area_level_1', 
                                                        x$types)],
                                            NA)}))
      
      county   <- unlist(lapply(address, 
                                function(x){ifelse(length(x$comps[grep('administrative_area_level_2', 
                                                                       x$types)]) == 1, 
                                            x$comps[grep('administrative_area_level_2', x$types)],
                                            NA)}))
      
      locality <- unlist(lapply(address, 
                                function(x){ifelse(length(x$comps[grep('locality', 
                                                                       x$types)]) == 1, 
                                            x$comps[grep('locality', x$types)],
                                            NA)}))

      df <- data.frame(locations,
                       lat, 
                       lng, 
                       locality, 
                       county, 
                       state, 
                       country, 
                       status, 
                       stringsAsFactors = F)
      
      names(df) <- c("locations", 
                     "lat",
                     "lng",
                     "locality",
                     "county", 
                     "state",
                     "country", 
                     "geocode_status")
      
      rownames(df) <- 1:nrow(df)
                              
      return(df)
                                      }

Now I can run my function on all the places we extracted from the song

places <- geocode(locations = places, APIkey = Noahs_APIkey)

View(places)
locations lat lng locality county state country geocode_status
Winnemucca 40.972958 -117.73568 Winnemucca Humboldt County Nevada United States OK
Mack 36.077470 -96.05270 Tulsa Tulsa County Oklahoma United States OK
Reno 39.529633 -119.81380 Reno Washoe County Nevada United States OK
Chicago 41.878114 -87.62980 Chicago Cook County Illinois United States OK
Fargo 46.877186 -96.78980 Fargo Cass County North Dakota United States OK
Minnesota 46.729553 -94.68590 NA NA Minnesota United States OK
Buffalo 42.886447 -78.87837 Buffalo Erie County New York United States OK
Toronto 43.653226 -79.38318 Toronto Toronto Division Ontario Canada OK
Winslow 35.024187 -110.69736 Winslow Navajo County Arizona United States OK
Sarasota 27.336435 -82.53065 Sarasota Sarasota County Florida United States OK
Wichita 37.687176 -97.33005 Wichita Sedgwick County Kansas United States OK
Tulsa 36.153982 -95.99277 Tulsa Tulsa County Oklahoma United States OK
Ottawa 45.421530 -75.69719 Ottawa Ottawa Division Ontario Canada OK
Oklahoma 35.007752 -97.09288 NA NA Oklahoma United States OK
Tampa 27.950575 -82.45718 Tampa Hillsborough County Florida United States OK
Panama 8.537981 -80.78213 NA NA NA Panama OK
Mattawa 46.737910 -119.90282 Mattawa Grant County Washington United States OK
La Paloma 37.179686 -93.23469 Springfield Greene County Missouri United States OK
Bangor 44.801613 -68.77123 Bangor Penobscot County Maine United States OK
Baltimore 39.290385 -76.61219 Baltimore NA Maryland United States OK
Salvador -12.977749 -38.50163 NA Salvador State of Bahia Brazil OK
Amarillo 35.221997 -101.83130 Amarillo Potter County Texas United States OK
Tocopilla -22.085798 -70.19301 Tocopilla Tocopilla Province Antofagasta Region Chile OK
Barranquilla 11.004107 -74.80698 Barranquilla Barranquilla Atlantico Colombia OK
Padilla 44.977238 -93.25094 Minneapolis Hennepin County Minnesota United States OK
Boston 42.360082 -71.05888 Boston Suffolk County Massachusetts United States OK
Charleston 32.776475 -79.93105 Charleston Charleston County South Carolina United States OK
Dayton 39.758948 -84.19161 Dayton Montgomery County Ohio United States OK
Louisiana 30.984298 -91.96233 NA NA Louisiana United States OK
Washington 47.751074 -120.74014 NA NA Washington United States OK
Houston 29.760427 -95.36980 Houston Harris County Texas United States OK
Kingston 44.231172 -76.48595 Kingston Frontenac County Ontario Canada OK
Texarkana 33.425125 -94.04769 Texarkana Bowie County Texas United States OK
Monterey 36.600238 -121.89468 Monterey Monterey County California United States OK
Faraday 33.863471 -118.28248 Gardena Los Angeles County California United States OK
Santa Fe 35.686975 -105.93780 Santa Fe Santa Fe County New Mexico United States OK
Tallapoosa 33.744550 -85.28801 Tallapoosa Haralson County Georgia United States OK
Glen Rock 40.962876 -74.13292 Glen Rock Bergen County New Jersey United States OK
Black Rock 36.094004 -95.86761 Tulsa Tulsa County Oklahoma United States OK
Little Rock 34.746481 -92.28959 Little Rock Pulaski County Arkansas United States OK
Oskaloosa 41.291673 -92.64936 Oskaloosa Mahaska County Iowa United States OK
Tennessee 35.517491 -86.58045 NA NA Tennessee United States OK
Hennessey 29.749978 -96.28326 Sealy Austin County Texas United States OK
Chicopee 42.148704 -72.60787 Chicopee Hampden County Massachusetts United States OK
Spirit Lake 46.274303 -122.13371 NA Skamania County Washington United States OK
Grand Lake 40.252207 -105.82307 Grand Lake Grand County Colorado United States OK
Devil’s Lake 43.418397 -89.73095 NA Sauk County Wisconsin United States OK
Crater Lake 42.944587 -122.10900 NA Klamath County Oregon United States OK
Louisville 38.252665 -85.75846 Louisville Jefferson County Kentucky United States OK
Nashville 36.162664 -86.78160 Nashville Davidson County Tennessee United States OK
Knoxville 35.960638 -83.92074 Knoxville Knox County Tennessee United States OK
Ombabika 50.233333 -87.90000 Ombabika Thunder Bay District Ontario Canada OK
Schefferville 54.824559 -66.81748 Schefferville Sept-Rivières—Caniapiscau Quebec Canada OK
Jacksonville 30.332184 -81.65565 Jacksonville Duval County Florida United States OK
Waterville 44.552011 -69.63171 Waterville Kennebec County Maine United States OK
Costa Rica 9.748917 -83.75343 NA NA NA Costa Rica OK
Pittsfield 42.450085 -73.24538 Pittsfield Berkshire County Massachusetts United States OK
Springfield 37.208957 -93.29230 Springfield Greene County Missouri United States OK
Bakersfield 35.373292 -119.01871 Bakersfield Kern County California United States OK
Shreveport 32.525152 -93.75018 Shreveport Caddo Parish Louisiana United States OK
Hackensack 40.885933 -74.04347 Hackensack Bergen County New Jersey United States OK
Cadillac 36.105848 -95.88572 Tulsa Tulsa County Oklahoma United States OK
Davenport 28.161405 -81.60174 Davenport Polk County Florida United States OK
Idaho 44.068202 -114.74204 NA NA Idaho United States OK
Jellico 36.587859 -84.12687 Jellico Campbell County Tennessee United States OK
Argentina -38.416097 -63.61667 NA NA NA Argentina OK
Diamantina -18.175990 -43.71425 NA Diamantina State of Minas Gerais Brazil OK
Pasadena 34.147785 -118.14452 Pasadena Los Angeles County California United States OK
Catalina 33.387886 -118.41631 NA Los Angeles County California United States OK
Pittsburgh 40.440625 -79.99589 Pittsburgh Allegheny County Pennsylvania United States OK
Parkersburg 39.266742 -81.56151 Parkersburg Wood County West Virginia United States OK
Gravelbourg 49.875676 -106.55732 Gravelbourg Division No. 3 Saskatchewan Canada OK
Colorado 39.550051 -105.78207 NA NA Colorado United States OK
Ellensburg 46.996514 -120.54785 Ellensburg Kittitas County Washington United States OK
Rexburg 43.823110 -111.79242 Rexburg Madison County Idaho United States OK
Vicksburg 32.352646 -90.87788 Vicksburg Warren County Mississippi United States OK
El Dorado 37.816756 -96.88700 El Dorado Butler County Kansas United States OK
Larimore 47.906657 -97.62675 Larimore Grand Forks County North Dakota United States OK
Admore 42.643147 -82.85890 Macomb Macomb County Michigan United States OK
Haverstraw 41.197595 -73.96458 Haverstraw Rockland County New York United States OK
Chatanika 65.111221 -147.46539 Chatanika Fairbanks North Star Alaska United States OK
Chaska 44.789345 -93.60184 Chaska Carver County Minnesota United States OK
Nebraska 41.492537 -99.90181 NA NA Nebraska United States OK
Alaska 64.200841 -149.49367 NA NA Alaska United States OK
Opelika 32.645412 -85.37828 Opelika Lee County Alabama United States OK
Baraboo 43.471094 -89.74429 Baraboo Sauk County Wisconsin United States OK
Waterloo 35.725330 -97.47810 Edmond Oklahoma County Oklahoma United States OK
Kalamazoo 42.291707 -85.58723 Kalamazoo Kalamazoo County Michigan United States OK
Kansas City 39.099727 -94.57857 Kansas City Jackson County Missouri United States OK
Sioux City 42.496342 -96.40494 Sioux City Woodbury County Iowa United States OK
Cedar City 37.677477 -113.06189 Cedar City Iron County Utah United States OK
Dodge City 37.752798 -100.01708 Dodge City Ford County Kansas United States OK
Fond du Lac 43.773045 -88.44705 Fond du Lac Fond du Lac County Wisconsin United States OK

Mapping “Everywhere”

#Now let's use leaflet to plot
    map <- leaflet(width = '100%', 
                   options = leafletOptions())
    map <- addProviderTiles(map,
                            providers$CartoDB.Positron)
    map <- setView(map, lng = -102, lat = 30, zoom = 2)
    map <- addMarkers(map, 
                      data = places[localities,],
                      lng = ~lng, 
                      lat = ~lat,
                      label = ~locations
                    )
 map <- addPolygons(map, 
                      data = country_shapes,
                      fillColor = "Orange",
                      weight = 2,
                      opacity = 1,
                      color = "Orange",
                      dashArray = "3",
                      fillOpacity = 0.7,
                      label = ~name_long,
                      highlight = highlightOptions(
                                  weight = 2,
                                  color = "white",
                                  dashArray = "3",
                                  fillOpacity = 0.75,
                                  bringToFront = TRUE)
                    )    
 map <- addPolygons(map, 
                      data = state_shapes,
                      fillColor = "Green",
                      weight = 2,
                      opacity = 1,
                      color = "Green",
                      dashArray = "3",
                      label = ~NAME,
                      fillOpacity = 0.7,
                      highlight = highlightOptions(
                                  weight = 2,
                                  color = "white",
                                  dashArray = "3",
                                  fillOpacity = 0.75,
                                  bringToFront = TRUE)
                    )    
 
 map 

The song certianly lists a lot of places. However, Jony Cash claims to have been everywhere. Clearly this song clearly is not comprehensive list.

If one wanted to determin, with some

Why has Cash been where he has been?

As one can see from the Map above, Jonny Cash has been plenty of places. He has not, however, been everywhere. This begs the question, why does Cash go where he goes? Below, this question is tackled with a logistic regression.

The level of analysis is US counties. Excent county by county data is available for the US. Additionaly, the bulk of places named in the song can be pinpointed to US counties. However, This means that the model have to take into account locations fromt he song that fall outside the US, or locations that include several counties, i.e. states.

The model uses four variables:

  1. Population. It seems likely that Cash will go where there are people for which to preform.

  2. Percent of Population Employed in Cattle Ranching . Cash maintained an image as a cow boy. it seems likely he would want to be seen amongst real cowboys.

  3. Percent of Population Incarcerated. Cash famously preformed at Prison’s around the county. It seems likely that he would have been where the prisons were.

County by county population data is taken from the ACS. the number of ranchers is taken from the EEO survey. The number of incarserated is taken from the census itself. Because the first two measures are estimates, they are not available for sparsely populated counties. We will only be consdiering counties for which we have data for all three variables . (Essentialy, this is the largest third of US Counties)

library(DescTools)
Population <-read.csv("./Population better.csv")
Ranchers <-read.csv("./Ranchers.csv")
Prisoners <-read.csv("./Prison Population.csv")

counties <- merge(Prisoners, merge(Population, Ranchers, by = 'Geography'), by = 'Geography')
names(counties)<-c('names','prison','pop','ranch')
counties$prison <- counties$prison/counties$pop
counties$ranch  <- counties$ranch/counties$pop

counties$visited <- ifelse(counties$names %in% paste0(places$county,", ",places$state), 1, 0)

model <- glm(visited ~ pop + prison + ranch, 
             data = counties, 
             family = binomial(link = "logit")
)
summary(model)
## 
## Call:
## glm(formula = visited ~ pop + prison + ranch, family = binomial(link = "logit"), 
##     data = counties)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2968  -0.3018  -0.2884  -0.2812   2.5633  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.329e+00  2.509e-01 -13.268  < 2e-16 ***
## pop          9.320e-07  2.212e-07   4.214 2.51e-05 ***
## prison      -5.051e-01  1.010e+01  -0.050    0.960    
## ranch        2.348e+01  6.256e+01   0.375    0.707    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 387.75  on 966  degrees of freedom
## Residual deviance: 364.88  on 963  degrees of freedom
## AIC: 372.88
## 
## Number of Fisher Scoring iterations: 6
DescTools::PseudoR2(model)
##   McFadden 
## 0.05897656

Looking at the distribution of residuals and the pheudo \(R^2\) it is clear this model has very little explainatory power. The McFadden pseudo R squared is particuarly damning; only about six percent in the variation in the likelythood that Jonny Cash has been to a county is explained in by the model’s chosen predictors. Additonally only one of the predcitors had a statistically significant effect: population. Note of population is significant, it is also tiny. I can say with condifence this model does not increas our understanding of where

It makes sense that this model would have little explainatory power. Jonny Cash did not actualy write the song, and the person who did likely did not actualy travel to all these place. Places were likely chosen based their name rather than any tangible charecteristic.